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Abstract. Let X[0..n — 1] and Y[0..m — 1] be two sorted arrays, and define the m x n matrix A 
by = + Y[j]. Frederickson and Jolinson [7] gave an efficient algoritiim for selecting tlie 

feth smallest element from A. We show how to make this algorithm lO-efhcient. Our cache-oblivious 
algorithm performs 0{{m + n)/B) lOs, where B is the block size of memory transfers. 



1. Introduction 

Let S be a multi-set of elements from a totally ordered universe and let k be an integer in the 
range 1 ^ k ^ \S\. The selection problem is to find a A;th smallest element of S, that is, an element 
X £ S that is kth in some non-decreasing total ordering of S. Selection is a fundamental problem in 
computer science and a key building block of many algorithms. Selection is trivial when S is sorted, 
but when S is not given in sorted order it becomes more challenging. A classical divide-and-conquer 
algorithm [HIS] solves the selection problem for unsorted inputs in OdSI) time. 

Often, the input is naturally organized as a two-dimensional matrix A with m rows and n columns. 
Using the classical algorithm one can perform selection in A in 0{mn) time, which is optimal in the 
worst case. When the rows and columns of the matrix are sorted, however, one can do much better. 
Frederickson and Johnson [71 18] gave an algorithm for this case — we will call it the FJ-algorithm 
from now on — that runs in 0(mlg(2n/m)) time; here we assume without loss of generality that 
m ^ n. Note that when m = n the running time is simply 0(n). 

In some applications the matrix A is defined succinctly by the Cartesian product of two given 
vectors X[0..n — 1] and Y[0..m — 1]. We are interested in the case where A = X + Y , that is, 

Am=X[i] + Y[j] 

where X and Y are sorted. (The symbol can mean any monotone binary operator.) Since 
X and Y are sorted, the rows and columns of A are sorted. Hence, one can perform selection in 
A in 0(mlg(2n/m)) time by FJ-algorithm. Selection in such sorted X -\-Y matrices is used as a 
subroutine in several other algorithms — see [H El [6l [TOl [131 US] for some examples. 

The FJ-algorithm is efficient in terms of CPU computation time. Unfortunately, it is not efficient 
when it comes to 10 behavior, because it accesses elements of the input arrays X and Y non- 
sequentially, in a pattern that does not exhibit locality of reference. This is the goal of our paper: 
to develop a variant of the algorithm that has better 10 behavior. 

The input- output complexity, or 10- complexity, of an algorithm is usually analyzed in the external- 
memory model introduced by Aggarwal and Vitter [2] . In this model the memory consists of two 
levels: a fast memory and a slow memory. The fast memory can store up to M words and the slow 
memory has unlimited storage capacity. Data is stored in the slow memory in blocks of size B. To 
be able to do computations on data in the slow memory, that data first has to be brought into the 
fast memory; data which is evicted from fast memory (to make room for other data) needs to be 
written back to the slow memory. Data is transferred between fast and slow memory in blocks. The 
lO-complexity of an algorithm is the number of block transfers it performs. 

The two levels in this abstract model can stand for any two consecutive levels in a multi-level 
memory hierarchy: the slow memory could be the disk and the fast memory the main memory, the 
slow memory could be the main memory and the fast memory the L3 cache, and so on. The values 
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of M and B are different at different levels; the higher up in the memory hierarchy, the larger the 
memory size M and block size B. 

Our main result is a variant of the FJ-algorithm for sorted X + Y matrices whose lO-complexity 
is 0(SCAN(n + m)). Here, Scan(s) is the number of lOs performed when scanning s consecutive 
items; Scan(s) ^ 1 + \s/B~\. Our algorithm is cache- oblivious [U], which means it is oblivious of 
the parameters M and B. In other words, the parameters M and B are only used in the analysis of 
the algorithm; they are not used in the algorithm itself. The beauty of cache-oblivious algorithms 
is that, since they do not depend on the values M and B, they are lO-efficient for all values of M 
and B and, hence, lO-efficient at all levels of a multi-level memory hierarchy]^ 

2. The FJ-Algorithm 

First, we give a rough outline of the FJ-algorithm [7j. A detailed description is given in Figure [!} 

Let X[0..n — 1] and y[0..m — 1] be two input arrays of real numbers, given in sorted order: -'^[O] ^ 
X[l] ^ • • • ^ X[n - 1] and Y[0] ^ Y[l] ^ • • • ^ Y[m - 1]. Let A[0..m - l][0..n - 1] be the matrix 
X + Y, that is, the matrix defined by = X[i] + Y[j]. We assume that m = n and that re is a 

power of 2; this can easily be ensured by implicitly padding the arrays X and Y suitably. 

Following Frederickson and Johnson, we call a submatrix of A a cell. The algorithm maintains a 
set C of active cells, such that the desired element will be present in one of the active cells. Initially, 
the entire matrix A is the sole active cell. 

The algorithm proceeds in Ign iterations. Let Cp denote the set of active cells at the beginning 
of the pih iteration, where p = 1,2,. . .,lgre. The pth iteration begins by splitting each cell of Cp into 
four smaller cells by bisecting each dimension. Let C* denote the list of cells obtained by splitting 
each cell of Cp into four. The algorithm next discards certain cells from C* which do not contain 
the desired element, thus obtaining the set Cp+i to be used in the next iteration. 

Cells are discarded based on their minimum and maximum elements. A cell C £ C* for which 
min(C) is larger than a certain number of other minima can safely be discarded because all elements 
of C will be larger than the desired element. Similarly, a cell C € C* for which max(C) is smaller 
than a certain number of other maxima can be discarded because all elements of C will be smaller 



than the desired element. The exact condition for discarding cells is given in step (2b) of the 
algorithm in Figure [T] 

The cells in Cp have size x (n/2P^^) and the cells in C* have size (re/2P) x (re/2P). Hence, 

after iteration p = Igre, the cells in Cp are singletons (that is, 1 x 1 cells). The classical selection 
algorithm is then used to find the desired element among these singletons. 

The following theorem stating the performance of the FJ-algorithm is a special case of the general 
theorem proved by Frederickson and Johnson [7]. 

Theorem 1. [7^ Given two sorted arrays X and Y , each of size re, the FJ-algorithm correctly 
computes an element of rank k in the matrix A = X + Y in 0{n) time. 

3. IO-Efficient Selection 

Next, we show how to make the algorithm of the previous section lO-efficient. Henceforth, we 
will refer to the slow memory in our two-level hierarchy as the disk and to the fast memory as the 
cache. We assume that the array X is laid out in order in re consecutive memory locations on disk. 
Similarly, the array Y is laid out in order in re consecutive memory locations on disk. 

The FJ-algorithm needs an efficient selection algorithm in steps (2b), (2c), and ([3|. Fortunately, 
the standard selection algorithm has good lO-behavior. 



"'^In the analysis of cache-oblivious algorithms it is assumed that the operating system uses an optimal block 
replacement strategy — see the paper by Frigo et al. [9] for a justification of this and some other assumptions in the 
model. 



CACHE-OBLIVIOUS SELECTION IN SORTED X + Y MATRICES 



3 



FJ-algorithm(X, Y, k): 

(1) Initialize Ci such that its only cell is the entire matrix A = X + Y . 

(2) for p := 1 to Ign do 

(a) Split each C £ Cp into four subcells to obtain the set C*. Let Lp := min{n, 2^+^ - 

(b) Let q := \k4P/n'^] + Lp. 

then Use a standard selection algorithm to select a qth element x„ in the multiset 
{min(C) : C £ C*}. Discard |C*| — q + 1 cells from C*, retaining every cell 
C with min(C) < and no cell with min(C) > Xu- 



!}• 



\k4P/ 



n 



Lr. 



(c) Let r : = 
if r ^ 1 

then Use a standard selection algorithm to select an rth element xi in the mul- 
tiset {max(C) : C G C*}. Discard r cells from C*, retaining every cell C 
with max(C) < xi and no cell with max(C) > xi. 

(d) Let k:=k- r{n^/4P) and let Cp+i := C*. 

(3) Select the kth element from the cells in Cp using a standard selection algorithm. 

Figure 1. The matrix selection algorithm of Frederickson and Johnson [7j. 



Lemma 2. The standard selection algorithm [4J selects an element of a given rank k from an array 
of s elements in 0{s) time and using 0(Scan(s)) IOs. 

Even though selection is the main subroutine used by the matrix selection algorithm, Lemma |2] 
does not imply that the FJ-algorithm is lO-efficient. The main problem is that maintaining the list 
of active cells can dominate the lO-cost of a naive implementation of the FJ-algorithm, leading to 
0(n) lO-complexity rather than 0(SCAN(n)). To make the algorithm lO-efficient, we need to take 
a detailed look at the manipulation of active cells. 

The FJ-algorithm needs a data structure to store the sets Cp and C* of active cells. One could 
use linked lists, but traversing a linked list is not lO-efficient because adjacent list elements could 
be stored in different blocks, requiring as many as one lO-operation per list element. Instead, we 
use arrays, which can store any list L compactly on disk in 0{\L\/B) blocks. 

We represent a cell A[ji..j2 — l][ii..i2 — 1] by the 8-tuple 

{ii,n,i2,j2,x[ii],x[i2 - i],Y[n],Y[j2 - 1]) , 

and we identify a cell with its corresponding 8-tuple. The active cells are stored in lexicographic 
order of their corresponding 8-tuples. From the 8-tuple representing cell C we can compute 
min(C) = X[ii] + Y[ji] and max(C) = X[i2 — 1] -|- Y[j2 — 1] in 0(1) time and no additional 

and ([3| of the FJ-algorithm can ah be done in 0(Scan(|C*|)) IOs. 
where we compute C* from Cp by splitting each cell into four subcells. 
iF'the cell {ii, ji,i2, j2, X[ii], X[i2],Y[ji],Y[j2]). Let im = (n +«2)/2 and 



IOs. Hence, steps (|2b|), ^2c 

pal 



The problem lies in step ( 
Suppose we have to sp 
let jm — {ji + j2)/2. The four subcells we must generate are as follows: 
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x[zi],x[z,„-i],y[ji],r[j™-i]) 

[il, jm, hyi, 32, X[ii], X[i,n - 1] , F , y [j2 - 1]) 
[im, jl,i2, im, X[i,n], X[i2 - 1] , F [ji] , F [j™ - 1]) 
{im, jm,i2, 32, X[i„^], X[i2 - l],Y[3m],Y[j2 - 1]) 
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Most components of the subcells can be computed from the components of C, except that X[iy 
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and need to be fetched from the array X, and Y[jm — 1] and l'[jm] need to be fetched from 

the array Y. If we are not careful, fetching these values will cost us an 10 each time and the whole 
algorithm will not be lO-efficient. Next we describe how to overcome this problem. 

Let us examine in what order the algorithm accesses the array X; the array Y will be discussed 
later. 

First consider the elements from X needed for the fifth component of the cells, which stores the 
minimum X-value in the cell. In the initialization step, the entire matrix A is the only active cell; its 
minimum X- value is X[0]. In the first iteration {p = 1) we split A into four subcells. The minimum 
X- values in those subcells are either X[0] (for the north-west and south-west subcells) or X[n/2] 
(for the north-east and south-east subcells). Since X[0] is conveniently stored in the original cell, 
we only need to access X[n/2]. In the second iteration, we need to access X[n/4] and X[3n/4]. In 
general, in the pth iteration {1 ^ p < Ign), each active cell in Cp has dimension {n/2P~^) x (n/2P~^), 
and the elements that need to be accessed to obtain their minimum X-values are X[(2i — 1) • (n/2^)] 
for 1 ^ i ^ 2^~^. (In fact, we do not necessarily need all these elements, since not all cells have to 
be active.) 

To enable lO-efficient access to these elements in X, we construct an array Xi[l..n/2 — 1] that 
stores, for any p with 1 ^ p < Ign, the elements needed in the pth iteration consecutively. Thus we 
define array Xi so that it has the following property: 

For all p in the range 1 ^ p < Ign, for all i in the range 1 ^ i ^ 2^^^, we have 

X^[2P-'+^-l]=x[{2i-l)^]. (1) 

Note that, together with X[0], the elements in Xi are exactly the elements in X at even- numbered 
positions. 

Similarly, the elements that need to be accessed to obtain the maximum X-values in the p-th 
iteration, namely X[(2i — 1) • {n/2P — 1)] for 1 ^ i ^ 2^"^, are stored in an array X2. Thus array 
X2[l..n/2 — 1] stores the odd-numbered elements in X (except X[n — 1]), as follows: 

For all p in the range 1 ^ p < Ign, for all i in the range 1 ^ i ^ 2^^^, we have 

X,[2P-'+^-l]=x[{2^-l)^-l\. (2) 

Next we show how to compute the array Xi efficiently; X2 can be computed similarly. 

Given an integer i, the bit-reversal of i is the integer /3(i) such that the binary string representing 
/3(z) is the reverse of the binary string representing i. The bit-reversal permutation Z' of an array Z 
is the permutation that maps that Z[i] to Z'[(3{i)]. The bit-reversal permutation can be computed 
recursively as follows: Copy all elements in even- numbered positions in Z in order to the first half 
of the array Z\ and copy all elements in odd-numbered positions of Z in order to the second half 
of Z'] recurse on both halves. 

Now suppose we only recurse on the first half of the array Z'; the elements in the second half 
are kept in the same relative order as in the input array Z. We call the resulting permutation the 
partial bit reversal. As we will show below, the partial bit reversal of array X is closely related to the 
array Xi that we want to compute. The recursive algorithm PBR given in Fig. |2] — a non-recursive 
version would also be possible — computes a partial bit reversal Z' of a given array Z[0..n — 1]. In 
the initial call, Z' is a copy of Z, and s = n. (Recall that we assumed n is a power of 2.) 

The following lemma, which gives the running time and 10 complexity of PBR, follows easily 



from the fact that steps 2a and 2b of PBR are just linear scans of arrays Z' , Even, and Odd, so 



these steps run in 0{s) time and 0(Scan(s)) IOs. 



Lemma 3. PBR{Z' ,n) runs in 0{n) time and uses 0(SCAN(n)) IOs. 
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Algorithm PBR(Z',s): 

(1) if s > 1 

(2) then Comment: Even[0..s/2 — 1] and Odd[0..s/2 — 1] are auxiliary arrays. 

(a) for i := to s — 1 do 

if i is even then Even[i/2] := Z'[i] else Odd[{i - l)/2] := Z'[i] 

(b) for i := to s/2 - 1 do Z'[i\ := Even[i] 

for i := s/2 to s - 1 do Z'[{\ := Odd[i - s/2] 

(c) PBR(Z',s/2) 

Figure 2. Algorithm to compute a partial bit-reversal permutation 

The next lemma shows the correspondence between the partial bit reversal of our input array X 
and the array Xi we want to compute. It implies that Xi can be obtained by computing the partial 
bit reversal X' of X and then taking the elements from X'[l..n/2 — 1] in order. 

Lemma 4. Let Z' be the partial bit reversal of an array Z[0..n — 1], where n is a power of 2, as 
computed by PBR. Then for all p in the range 1 ^ p < Ign, for all i in the range 1 ^ i ^ 2^~^, we 
have 

Z' [2P-^ + i-l]=Z 



{2i - 1) 



Proof. Let j > be an even index, and let £ ^ 1 and k ^ 1 he such that j = {21 — 1)2^ . Now 
consider what happens to element Z[j\. Initially Z' is a copy of Z, so Z[j] is stored in Z'[j]. Then, in 



the first call to PBR — that is, the call with s = n — it will be moved to Z'[j /2] by steps (2a) and 2b 
In the recursive call with s = n/2 it will be moved to Z'[j'/4] (if k > 1). This process continues k 
times, until the recursive call is made with s = n/2}^ . At this point Z\i\ is stored in Z'\2i — 1], and 



step (2a) moves the element to Z'\n/2}^^^ + ^ — 1]. After that it will not be moved anymore by the 
algorithm. 

Now set i = I and take p such that 2'^ = n/2^ . Then n/2^^^ = 2^^-*^ and we can conclude that 
Z[{2i - 1) • (n/2P)] ends up in Z'[2P-^ + i - 1], as required. □ 

In what follows, we use Piij) to denote the position of X[j] in the array Xi, for j > and j even. 
Thus, according to Equation ([T|, we have /3i((2z — 1) •(n/2P)) = 2P~^ + i — l. Similarly, P2{j) denotes 
the position of X[j] in the array X2, for j < n—1 and j odd; thus /32((2i — l)-(n/2^') — 1) = 2^"^+^ — 1. 



The arrays Xi and X2 give us the X[-]-values in the order they are needed by the cell-partitioning 
step of the FJ-algorithm. However, to partition a cell we also need to fetch new Y[-] values. For this 
we would like to use the same approach: compute in a preprocessing step two arrays yi[l..n/2 — 1] 
and Y2[\-..n/2 — 1], which contain the y[-]-values in the order needed by the algorithm. With the 
X[-]-values this approach was possible, because the cells in Cp are kept in lexicographical order, 
with the zi-value being dominant. Hence, we knew exactly not only which X[-]-values were needed 
in the p-ih. iteration (namely X[{2i — 1) • (n/2P)] for 1 ^ i ^ 2^"^), but also in which order (namely 
according to increasing index). But for the y[-]-values we only know which values we need in the 
p-th iteration; we do not know in which order we need them, because the ii-coordinate is dominant 
in the order of the cells in Cp. Next we will show that the approach works nevertheless. Thus we 
compute arrays Yi and Y2 in exactly the same way as the arrays Xi and X2 were computed. Then 
we partition the cells with the algorithm shown in Figure [3j 

Before we can prove that this algorithm is indeed lO-efficient, we need to deal with one subtlety: 
we need to be more specific about the exact implementation of step (2b) of the FJ-algorithm in 
case the gth element, Xu, is not unique. More precisely, we need to specify which of the cells C 
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PARTITION(Cp, Xi,X2,Yi,Y2,p): 

(1) Let Cp^R and Cp^L be two arrays of twice the size as Cp. 

(2) for i := to \Cp\ - 1 do 

Let C = {iiJi,i2,j2,X[ii],X[i2 - l],Y[ji],Y[j2 - 1]) be the cell in Cp[i\. 
Let im = (n + «2)/2 and let jm = (ii + i2)/2. 

(a) Fetch X[im - 1] from Xi[Pi{im - 1)] and X[im] from X2[/?2(^m)] 

(b) Fetch Y[j^ - 1] from Yi[pi{j„^ - 1)] and Y[jm] from Y2[(32{jm)] 

(c) Cp,L[2z] ^ (ii, - l],l"[ji],l'[jm - 1]) 

(d) Cp,L[2i + l] ^ {im,jl,i2,jm,X[im],X[i2 - l],Y[ji],Y[jm - I]) 

(e) Cp^R[2i] ^ {h, jm,im, j2, X[ii], X[im - l\,Y[jm\,Y[j2 - 1]) 

(f) Cp,R[2i + l]^{im,jm,i2,j2,X[im],X[i2-l],Y[jm],Y[j2-l]) 

(3) Comment: Now Cp^R and Cp^L together contain the new subcells, and both arrays are sorted 
lexicographically. 

(4) Merge Cp^R and Cp^L into an array C* that is sorted lexicographically. 

(5) return C*. 



Figure 3. Partitioning each cell in Cp into four subcells. 



with min(C) = Xu are discarded and which are kept. Similarly, we must specify which of the cells 
C with max(C) = xi are discarded and which are kept in step (2c). We do this as follows. 



Recall that we maintain C* in lexicographic order. Now we can implement step (2b) by removing 



from Cp exactly those cells whose ranks are greater than q according to this lexicographical order. 
This implies that if we remove a certain cell C, we will also remove all cells to the south-east of C 
(including the ones to the south of C, and the ones to the east of C). We use a similar strategy to 



guarantee that when we remove a cell in step (2c), we also remove all cells to its north-west. With 



this implementation, the active cells have the following properties — see also Figure |3j 

(i) All active cells with the same column index are consecutive. 

(ii) The active cell with the largest row index in a given column — note that row indices increase 
when going downwards in Figure [3] — cannot have row index smaller than the any active cell 
in the column to its right. In other words, if we consider the lowest active cells in each 
column and we consider the columns from left to right, then the the row indices of these 
highest active cells are non-increasing. 

These properties are essential to get good lO-complexity of Partition. 



Figure 4. The structure of the active cells, and the order in which they are accessed 
(which is the lexicographic order). Active cells are white, discarded cells are grey. 
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Lemma 5. Algorithm Partition produces a lexicographically sorted array C* of all subcells re- 
sulting from partitioning every cell in Cp into four. PARTITION runs in 0{\Cp\) time and performs 
0(SCAN(|Cp| +2P)) lOs. 



Proof. The correctness of the algorithm directly follows from the fact that, by definition of f3i and 



/?2, the correct values are fetched in steps (2a) and (2b). 



To bound the running time, we note that and P2{-) can be evaluated in 0(1) time. Indeed, 
when we evaluate e.g. for some j, we know the value of p such that j = {2i — 1) ■ (n/2^') — this 

p is a parameter of Partition. Given p, we have = 2P~^ + (j • (l^/n) + l)/2 — 1. It follows 

that the running time is 0{n). 

As for the number of lOs, all accesses to Cp, as well as step (Q, take 0(SCAN(|Cp| + 2^)) lOs in 
total. Hence, it remains to argue about the accesses to Xi, X2, Yi, and 1^2- 



We first consider the accesses to Xi. As argued earlier, the cells in Cp have size {njl^ ^ 



X 



(n/2P-i), which means we need to fetch from X\ (a subset of) the elements X[(2i — 1) • (n/2P ^)] 
for 1 ^ i ^ 2^^^ . By the definition of X\ — see Equation ([l| — these elements are consecutive in X\. 
Moreover, these elements are accessed from left to right in X\, because the cells in Cp are sorted 
in increasing order of their first coordinate. Hence, all these accesses to X\ take 0(Scan(2P~^)) = 
0(Scan(2P)) IOs in total. Symmetric reasoning gives the same bound on the number of accesses 
to X2. 

Now consider the accesses to Y\\ symmetric reasoning bounds the accesses to 12- Consider 
Figure |3j The active cells will be visited by the algorithm in lexicographic order, as indicated in the 
figure. This means that the algorithm may go back and forth in Y\. Moreover, when going back, the 
algorithm may jump from accessing some element Y\ [j] to accessing another element Y\ [j'] where 
j — j' > 1; we call j — j' the length of the jump. Jumps are significant because each jump may incur 
a cost of one 10 operation. (Jumps are also possible when accessing Xi or X2. Since in Xi and 
X2 we only jump forward, this does not increase the number of IOs there.) Note that the elements 
needed within a single column of active cells, are stored in the correct order in Yi. (Here the term 
"column" refers to a column in the matrix of whose cells are submatrices of size (n/2^) x {n/2'^).) 
When we step from the lowest active cell in one column to highest active cell in the next column, 
however, we may jump in Yi. Now suppose that instead of jumping from one location to the next, 
we visit all intermediate locations as well. Hence, after visiting the new traversal always 

proceeds to either Yi[j — 1] or Yi[j + 1]. We call such a traversal well-behaved. Clearly the number 
of IOs needed by the new traversal of Yi is not more than the number of traversals needed by the 
original traversal. 

The original traversal visited \Cp\ (not necessarily distinct) locations in Yi. We claim that the 
length of the new, well-behaved traversal is 0(|C| + 2^). To show this, we must bound the total 
length of all backward jumps. Consider a backward jump from the lowest active cell in some 
column C to the highest active cell in the next column C . This jump crosses a number of rows. By 
properties (i) and (ii) of the active cells, for each row that is crossed, at least one of the following 
three condition holds: C contains an active cell in this row, C contains an active cell in this row, 
or the row will not be visited again later. This is easily seen to imply that the total length of all 
jumps is 0(|CP| + 2P), as claimed. 

It remains to observe that, assuming M ^ 2B — that is, assuming at least two blocks fit in the 
cache — any well-behaved traversal of length L needs Scan(L) IOs. Indeed, suppose we need to read 
a new block when we step from Yi [i] to Yi -|- 1] . Then we read the block starting at Yi [i -\- 1] and 
can keep the block ending at Yi[i] in cache. Hence, at least B — 1 more forward steps or at least 
B backward steps are needed before another block needs to be read. We conclude that the number 
of IOs performed in accessing Yi (and, similarly, I2) is 0(Scan(|C^| -1-2^'), which finishes the proof 
for the number of IOs. □ 
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Theorem 6. There exists a cache- oblivious implementation of the matrix selection algorithm of 
Frederickson and Johnson for sorted X + Y matrices using 0(SCAN(n)) lOs and 0{n) time, where 
n is the maximum of the lengths of X and Y . 

Proof. By Lemma|3] the computation of the arrays Xi, X2, li, and Y2 takes 0(SCAN(n)) lOs and 
0(n) time. Now consider the main algorithm. Frederickson and Johnson |7] proved that |C^|, the 
number of active cells in the beginning of the pth iteration, is 0(2^). By Lemmas [2] |4] and [5| this 
implies that the total lO-cost is bounded by 

Ign 

^0(Scan(2?')) = 0(ScAN(n)). 
p=i 

Since the subroutine Partition runs in 0(|Cp|), the running time of the main algorithm is un- 
changed from the original FJ-algorithm, which runs in 0(n) time. □ 

4. Conclusion 

In this paper, we gave an lO-efficient cache-oblivious version of the classical matrix selection 
algorithm of Frederickson and Johnson for selecting a rank-A; element in an n x n matrix given 
succinctly in the form A:= X + Y . 

If the matrix A is not square — that is, if its dimensions were mxn where m < n — then a different 
approach seems to be required to make the matrix selection algorithm lO-efficient. One would like 
to obtain an lO-cost of 

/m 2n\ 

O logB — • 

However, we already spend 0((m -|- n)/B) lOs in permuting the input arrays as a pre-processing 
step, which dominates the lO-cost of the subsequent algorithm. It seems difficult to avoid the high 
lO-cost of permuting both input arrays so that they can be accessed lO-efficiently. A completely 
new algorithm may be necessary to achieve lO-optimal matrix selection in sorted X + Y matrices 
that are not square. 
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